1
00:00:09,784 --> 00:00:11,867
- Okay let's get started.

2
00:00:13,038 --> 00:00:15,888
Alright, so welcome to lecture 14,

3
00:00:15,888 --> 00:00:20,884
and today we'll be talking
about reinforcement learning.

4
00:00:20,884 --> 00:00:23,222
So some administrative details first,

5
00:00:23,222 --> 00:00:30,346
update on grades. Midterm grades were released last night, so
see Piazza for more information and statistics about that.

6
00:00:30,346 --> 00:00:35,402
And we also have A2 and milestone grades
scheduled for later this week.

7
00:00:36,768 --> 00:00:40,682
Also, about your projects, all teams
must register your projects.

8
00:00:40,682 --> 00:00:47,580
So on Piazza we have a form posted, so you should go there and this is
required, every team should go and fill out this form with information

9
00:00:47,580 --> 00:00:53,214
about your project, that we'll use for
final grading and the poster session.

10
00:00:53,214 --> 00:01:01,779
And the Tiny ImageNet evaluation servers are also now online
for those of you who are doing the Tiny ImageNet challenge.

11
00:01:01,779 --> 00:01:06,193
We also have a link to a course survey on
Piazza that was released a few days ago,

12
00:01:06,193 --> 00:01:13,600
so, please fill it out if you guys haven't already. We'd love
to have your feedback and know how we can improve this class.

13
00:01:16,589 --> 00:01:19,650
Okay, so the topic of today,
reinforcement learning.

14
00:01:19,650 --> 00:01:22,544
Alright, so so far we've talked
about supervised learning,

15
00:01:22,544 --> 00:01:30,498
which is about a type of problem where we have data x and then we have
labels y and our goal is to learn a function that is mapping from x to y.

16
00:01:30,498 --> 00:01:35,067
So, for example, the classification
problem that we've been working with.

17
00:01:35,067 --> 00:01:37,753
We also talked last lecture
about unsupervised learning,

18
00:01:37,753 --> 00:01:45,362
which is the problem where we have just data and no labels, and our
goal is to learn some underlying, hidden structure of the data.

19
00:01:45,362 --> 00:01:50,528
So, an example of this is the generative
models that we talked about last lecture.

20
00:01:52,040 --> 00:01:57,370
And so today we're going to talk about a different kind
of problem set-up, the reinforcement learning problem.

21
00:01:57,370 --> 00:02:01,824
And so here we have an agent
that can take actions in its environment,

22
00:02:01,824 --> 00:02:04,352
and it can receive rewards
for for its action.

23
00:02:04,352 --> 00:02:09,959
And its goal is going to be to learn how to take
actions in a way that can maximize its reward.

24
00:02:09,959 --> 00:02:14,101
And so we'll talk about this
in a lot more detail today.

25
00:02:14,101 --> 00:02:18,116
So, the outline for today, we're going to first
talk about the reinforcement learning problem,

26
00:02:18,116 --> 00:02:20,927
and then we'll talk about
Markov decision processes,

27
00:02:20,927 --> 00:02:24,747
which is a formalism of the
reinforcement learning problem,

28
00:02:24,747 --> 00:02:31,095
and then we'll talk about two major classes of
RL algorithms, Q-learning and policy gradients.

29
00:02:32,876 --> 00:02:38,936
So, in the reinforcement learning set up, what we
have is we have an agent and we have an environment.

30
00:02:38,936 --> 00:02:43,268
And so the environment
gives the agent a state.

31
00:02:43,268 --> 00:02:46,877
In turn, the agent is
going to take an action,

32
00:02:46,877 --> 00:02:52,609
and then the environment is going to give
back a reward, as well as the next state.

33
00:02:52,609 --> 00:03:00,918
And so this is going to keep going on in this loop, on and on, until the
environment gives back a terminal state, which then ends the episode.

34
00:03:00,918 --> 00:03:03,401
So, let's see some examples of this.

35
00:03:03,401 --> 00:03:05,536
First we have here the cart-pole problem,

36
00:03:05,536 --> 00:03:11,142
which is a classic problem that some of you
may have seen, in, for example, 229 before.

37
00:03:11,142 --> 00:03:16,252
And so this objective here is that you want
to balance a pole on top of a movable cart.

38
00:03:16,252 --> 00:03:20,280
Alright, so the state that you have here
is your current description of the system.

39
00:03:20,280 --> 00:03:28,206
So, for example, angular, angular speed of your pole,
your position, and the horizontal velocity of your cart.

40
00:03:28,206 --> 00:03:33,224
And the actions you can take are horizontal
forces that you apply onto the cart, right?

41
00:03:33,224 --> 00:03:38,387
So you're basically trying to move this cart
around to try and balance this pole on top of it.

42
00:03:38,387 --> 00:03:43,990
And the reward that you're getting from this environment
is one at each time step if your pole is upright.

43
00:03:43,990 --> 00:03:48,143
So you basically want to keep this pole
balanced for as long as you can.

44
00:03:49,286 --> 00:03:52,192
Okay, so here's another example
of a classic RL problem.

45
00:03:52,192 --> 00:03:53,998
Here is robot locomotion.

46
00:03:53,998 --> 00:03:59,670
So we have here an example of a humanoid
robot, as well as an ant robot model.

47
00:03:59,670 --> 00:04:03,128
And our objective here is to
make the robot move forward.

48
00:04:03,128 --> 00:04:10,807
And so the state that we have describing our system is the
angle and the positions of all the joints of our robots.

49
00:04:10,807 --> 00:04:15,887
And then the actions that we can take are
the torques applied onto these joints,

50
00:04:15,887 --> 00:04:21,228
right, and so these are trying to make the robot
move forward and then the reward that we get is

51
00:04:21,228 --> 00:04:31,701
our forward movement as well as, I think, in the time of, in the case of the humanoid, also,
you can have something like a reward of one for each time step that this robot is upright.

52
00:04:33,521 --> 00:04:38,384
So, games are also a big class of
problems that can be formulated with RL.

53
00:04:38,384 --> 00:04:40,700
So, for example, here we have Atari games

54
00:04:40,700 --> 00:04:44,280
which are a classic success
of deep reinforcement learning

55
00:04:44,280 --> 00:04:48,574
and so here the objective is to complete these
games with the highest possible score, right.

56
00:04:48,574 --> 00:04:52,753
So, your agent is basically a player
that's trying to play these games.

57
00:04:52,753 --> 00:04:57,506
And the state that you have is going to be
the raw pixels of the game state.

58
00:04:57,506 --> 00:05:02,882
Right, so these are just the pixels on the screen
that you would see as you're playing the game.

59
00:05:02,882 --> 00:05:09,912
And then the actions that you have are your game controls, so for
example, in some games maybe moving left to right, up or down.

60
00:05:09,912 --> 00:05:15,667
And then the score that you have is your score increase or
decrease at each time step, and your goal is going to be

61
00:05:15,667 --> 00:05:27,572
to maximize your total score over the course of the game. And, finally,
here we have another example of a game here. It's Go, which is

62
00:05:27,573 --> 00:05:31,697
something that was a huge achievement of
deep reinforcement learning last year,

63
00:05:31,697 --> 00:05:38,588
when Deep Minds AlphaGo beats Lee Sedol, which is
one of the best Go players of the last few years,

64
00:05:38,589 --> 00:05:50,918
and this is actually in the news again for, as some of you may have seen, there's
another Go competition going on now with AlphaGo versus a top-ranked Go player.

65
00:05:50,919 --> 00:05:56,295
And so the objective here is to win
the game, and our state is the position

66
00:05:56,295 --> 00:06:03,911
of all the pieces, the action is where to put the next piece down, and the
reward is, one, if you win at the end of the game, and zero otherwise.

67
00:06:03,912 --> 00:06:08,411
And we'll also talk about this one
in a little bit more detail, later.

68
00:06:08,411 --> 00:06:13,329
Okay, so how can we mathematically
formalize the RL problem, right?

69
00:06:13,330 --> 00:06:18,051
This loop that we talked about earlier,
of environments giving agents states,

70
00:06:18,051 --> 00:06:20,634
and then agents taking actions.

71
00:06:22,394 --> 00:06:28,512
So, a Markov decision process is the
mathematical formulation of the RL problem,

72
00:06:28,512 --> 00:06:31,447
and an MDP satisfies the Markov property,

73
00:06:31,447 --> 00:06:36,107
which is that the current state completely
characterizes the state of the world.

74
00:06:36,107 --> 00:06:40,164
And an MDP here is defined
by tuple of objects,

75
00:06:40,164 --> 00:06:43,170
consisting of S, which is
the set of possible states.

76
00:06:43,170 --> 00:06:45,762
We have A, our set of possible actions,

77
00:06:45,762 --> 00:06:51,694
we also have R, our distribution of
our reward, given a state, action pair,

78
00:06:51,694 --> 00:06:55,323
so it's a function mapping from
state action to your reward.

79
00:06:55,323 --> 00:06:57,430
You also have P, which is
a transition probability

80
00:06:57,430 --> 00:07:02,940
distribution over your next state, that you're going
to transition to given your state, action pair.

81
00:07:02,940 --> 00:07:05,718
And then finally we have a
Gamma, a discount factor,

82
00:07:05,718 --> 00:07:12,970
which is basically saying how much we value
rewards coming up soon versus later on.

83
00:07:14,203 --> 00:07:17,395
So, the way the Markov
Decision Process works is that

84
00:07:17,395 --> 00:07:20,053
at our initial time step t equals zero,

85
00:07:20,053 --> 00:07:26,362
the environment is going to sample some initial state as
zero, from the initial state distribution, p of s zero.

86
00:07:26,363 --> 00:07:32,253
And then, once it has that, then from time t equals zero
until it's done, we're going to iterate through this loop

87
00:07:32,253 --> 00:07:35,797
where the agent is going to
select an action, a sub t.

88
00:07:35,797 --> 00:07:38,885
The environment is going to
sample a reward from here,

89
00:07:38,885 --> 00:07:44,032
so reward given your state and the
action that you just took.

90
00:07:44,032 --> 00:07:51,534
It's also going to sample the next state, at time
t plus one, given your probability distribution

91
00:07:51,534 --> 00:07:58,706
and then the agent is going to receive the reward, as well as the
next state, and then we're going to through this process again,

92
00:07:58,707 --> 00:08:05,542
and keep looping; agent will select the next
action, and so on until the episode is over.

93
00:08:05,542 --> 00:08:06,989
Okay, so

94
00:08:06,989 --> 00:08:10,724
now based on this, we
can define a policy pi,

95
00:08:10,724 --> 00:08:16,651
which is a function from your states to your actions
that specifies what action to take in each state.

96
00:08:16,651 --> 00:08:19,748
And this can be either
deterministic or stochastic.

97
00:08:19,748 --> 00:08:27,204
And our objective now is to going to be to find your optimal
policy pi star, that maximizes your cumulative discounted reward.

98
00:08:27,205 --> 00:08:35,508
So we can see here we have our some of our future rewards,
which can be also discounted by your discount factor.

99
00:08:35,509 --> 00:08:39,327
So, let's look at an
example of a simple MDP.

100
00:08:39,327 --> 00:08:44,533
And here we have Grid World, which is this
task where we have this grid of states.

101
00:08:44,533 --> 00:08:50,295
So you can be in any of these
cells of your grid, which are your states.

102
00:08:50,295 --> 00:08:52,613
And you can take actions from your states,

103
00:08:52,613 --> 00:08:59,298
and so these actions are going to be simple movements,
moving to your right, to your left, up or down.

104
00:08:59,299 --> 00:09:08,858
And you're going to get a negative reward for each transition or each
time step, basically, that happens. Each movement that you take,

105
00:09:08,859 --> 00:09:11,989
and this can be something
like R equals negative one.

106
00:09:11,989 --> 00:09:20,054
And so your objective is going to be to reach one of the terminal states,
which are the gray states shown here, in the least number of actions.

107
00:09:20,055 --> 00:09:27,624
Right, so the longer that you take to reach your terminal state,
you're going to keep accumulating these negative rewards.

108
00:09:27,625 --> 00:09:30,540
Okay, so if you look at
a random policy here,

109
00:09:30,540 --> 00:09:39,089
a random policy would consist of, basically, at any given state or cell that you're
in just sampling randomly which direction that you're going to move in next.

110
00:09:39,090 --> 00:09:41,843
Right, so all of these
have equal probability.

111
00:09:41,843 --> 00:09:46,518
On the other hand, an optimal policy that
we would like to have is

112
00:09:46,518 --> 00:09:51,866
basically taking the action, the direction
that will move us closest to a terminal state.

113
00:09:51,866 --> 00:09:59,170
So you can see here, if we're right next to one of the terminal states we
should always move in the direction that gets us to this terminal state.

114
00:09:59,171 --> 00:10:09,118
And otherwise, if you're in one of these other states, you want to
take the direction that will take you closest to one of these states.

115
00:10:09,119 --> 00:10:17,154
Okay, so now given this description of our MDP, what we
want to do is we want to find our optimal policy pi star.

116
00:10:17,155 --> 00:10:20,755
Right, our policy that's
maximizing the sum of the rewards.

117
00:10:20,755 --> 00:10:29,730
And so this optimal policy is going to tell us, given any state that we're in, what is the
action that we should take in order to maximize the sum of the rewards that we'll get.

118
00:10:29,731 --> 00:10:34,091
And so one question is how do we
handle the randomness in the MDP, right?

119
00:10:34,091 --> 00:10:39,073
We have randomness in terms of our
initial state that we're sampling,

120
00:10:39,073 --> 00:10:46,340
in therms of this transition probability distribution that
will give us distribution of our next states, and so on.

121
00:10:46,341 --> 00:10:51,947
Also what we'll do is we'll work, then, with
maximizing our expected sum of the rewards.

122
00:10:51,947 --> 00:11:02,956
So, formally, we can write our optimal policy pi star as maximizing this
expected sum of future rewards over policy's pi, where we have our initial state

123
00:11:02,957 --> 00:11:05,103
sampled from our state distribution.

124
00:11:05,103 --> 00:11:09,127
We have our actions,
sampled from our policy, given the state.

125
00:11:09,127 --> 00:11:16,423
And then we have our next states sampled from
our transition probability distributions.

126
00:11:16,423 --> 00:11:22,142
Okay, so before we talk about exactly
how we're going to find this policy,

127
00:11:22,143 --> 00:11:26,787
let's first talk about a few definitions
that's going to be helpful for us in doing so.

128
00:11:26,787 --> 00:11:31,405
So, specifically, the value function
and the Q-value function.

129
00:11:31,405 --> 00:11:37,425
So, as we follow the policy, we're going to sample
trajectories or paths, right, for every episode.

130
00:11:37,426 --> 00:11:43,611
And we're going to have our initial state as zero,
a-zero, r-zero, s-one, a-one, r-one, and so on.

131
00:11:43,611 --> 00:11:49,331
We're going to have this trajectory of
states, actions, and rewards that we get.

132
00:11:49,331 --> 00:11:52,613
And so, how good is a state
that we're currently in?

133
00:11:52,613 --> 00:12:10,799
Well, the value function at any state s, is the expected cumulative reward following the policy from state s, from here
on out. Right, so it's going to be expected value of our expected cumulative reward, starting from our current state.

134
00:12:10,800 --> 00:12:13,286
And then how good is a state, action pair?

135
00:12:13,286 --> 00:12:17,370
So how good is taking action a in state s?

136
00:12:17,370 --> 00:12:20,468
And we define this using
a Q-value function,

137
00:12:20,468 --> 00:12:27,741
which is, the expected cumulative reward from taking
action a in state s and then following the policy.

138
00:12:29,708 --> 00:12:45,098
Right, so then, the optimal Q-value function that we can get is going to be Q star, which is the
maximum expected cumulative reward that we can get from a given state action pair, defined here.

139
00:12:45,099 --> 00:12:52,017
So now we're going to see one important thing in reinforcement
learning, which is called the Bellman equation.

140
00:12:52,018 --> 00:12:57,697
So let's consider this a Q-value function
from the optimal policy Q star,

141
00:12:57,697 --> 00:13:00,911
which is then going to
satisfy this Bellman equation,

142
00:13:00,911 --> 00:13:05,194
which is this identity shown here,
and what this means is that

143
00:13:05,194 --> 00:13:11,748
given any state, action pair, s and a, the
value of this pair is going to be the reward

144
00:13:11,748 --> 00:13:18,867
that you're going to get, r, plus the value of whatever
state that you end up in. So, let's say, s prime.

145
00:13:18,868 --> 00:13:24,150
And since we know that we have the optimal
policy, then we also know that we're going to

146
00:13:24,150 --> 00:13:28,746
play the best action that we can,
right, at our state s prime.

147
00:13:28,746 --> 00:13:34,432
And so then, the value at state s prime is
just going to be the maximum over our actions,

148
00:13:34,432 --> 00:13:38,626
a prime, of Q star at s prime, a prime.

149
00:13:38,626 --> 00:13:44,119
And so then we get this
identity here, for optimal Q-value.

150
00:13:44,119 --> 00:13:54,251
Right, and then also, as always, we have this expectation here, because
we have randomness over what state that we're going to end up in.

151
00:13:54,252 --> 00:14:02,487
And then we can also infer, from here, that our optimal policy, right, is going
to consist of taking the best action in any state, as specified by Q star.

152
00:14:02,488 --> 00:14:08,436
Q star is going to tell us of the maximum future
reward that we can get from any of our actions,

153
00:14:08,437 --> 00:14:16,862
so we should just take a policy that's following this and
just taking the action that's going to lead to best reward.

154
00:14:16,863 --> 00:14:21,025
Okay, so how can we solve
for this optimal policy?

155
00:14:21,025 --> 00:14:25,692
So, one way we can solve for this is something
called a value iteration algorithm,

156
00:14:25,692 --> 00:14:29,527
where we're going to use this Bellman
equation as an iterative update.

157
00:14:29,527 --> 00:14:37,997
So at each step, we're going to refine our approximation
of Q star by trying to enforce the Bellman equation.

158
00:14:39,347 --> 00:14:50,830
And so, under some mathematical conditions, we also know that this sequence Q, i of
our Q-function is going to converge to our optimal Q star as i approaches infinity.

159
00:14:54,257 --> 00:14:58,329
And so this, this works well,
but what's the problem with this?

160
00:14:59,184 --> 00:15:01,887
Well, an important problem
is that this is not scalable.

161
00:15:01,887 --> 00:15:02,720
Right?

162
00:15:02,720 --> 00:15:08,596
We have to compute Q of s, a here for every state,
action pair in order to make our iterative updates.

163
00:15:08,597 --> 00:15:18,932
Right, but then this is a problem if, for example, if we look at these the state of, for
example, an Atari game that we had earlier, it's going to be your screen of pixels.

164
00:15:18,933 --> 00:15:28,724
And this is a huge state space, and it's basically computationally
infeasible to compute this for the entire state space.

165
00:15:28,725 --> 00:15:35,907
Okay, so what's the solution to this? Well, we can
use a function approximator to estimate Q of s, a

166
00:15:35,908 --> 00:15:37,620
so, for example, a neural network, right.

167
00:15:37,620 --> 00:15:48,471
So, we've seen before that any time, if we have some really complex function that don't
know, that we want to estimate, a neural network is a good way to estimate this.

168
00:15:48,472 --> 00:15:54,242
Okay, so this is going to take us to our formulation
of Q-learning that we're going to look at.

169
00:15:54,242 --> 00:16:02,950
And so, what we're going to do is we're going to use a function
approximator in order to estimate our action value function.

170
00:16:02,118 --> 00:16:02,951
Right?

171
00:16:02,951 --> 00:16:08,141
And if this function approximator is a deep neural
network, which is what's been used recently,

172
00:16:08,142 --> 00:16:10,782
then this is going to be
called deep Q-learning.

173
00:16:10,782 --> 00:16:20,149
And so this is something that you'll hear around as one of the
common approaches to deep reinforcement learning that's in use.

174
00:16:20,150 --> 00:16:30,765
Right, and so in this case, we also have our function parameters theta here, so our
Q-value function is determined by these weights, theta, of our neural network.

175
00:16:33,050 --> 00:16:37,970
Okay, so given this function approximation,
how do we solve for our optimal policy?

176
00:16:37,970 --> 00:16:44,744
So remember that we want to find a Q-function
that's satisfying the Bellman equation.

177
00:16:44,744 --> 00:16:54,712
Right, and so we want to enforce this Bellman equation to happen, so what we
can do when we have this neural network approximating our Q-function is that

178
00:16:54,713 --> 00:17:00,239
we can train this where our loss function is going to try
and minimize the error of our Bellman equation, right?

179
00:17:00,240 --> 00:17:03,689
Or how far q of s, a is from its target,

180
00:17:03,689 --> 00:17:09,853
which is the Y_i here, the right hand side
of the Bellman equation that we saw earlier.

181
00:17:09,853 --> 00:17:16,928
So, we're basically going to take these forward passes
of our loss function, trying to minimize this error

182
00:17:16,929 --> 00:17:28,182
and then our backward pass, our gradient update, is just going to be you just
take the gradient of this loss, with respect to our network parameter's theta.

183
00:17:28,183 --> 00:17:38,435
Right, and so our goal is again to have this effect as we're taking gradient
steps of iteratively trying to make our Q-function closer to our target value.

184
00:17:38,436 --> 00:17:43,524
So, any questions about this?
Okay.

185
00:17:44,537 --> 00:17:53,369
So let's look at a case study of an example where one of the classic
examples of deep reinforcement learning where this approach was applied.

186
00:17:53,370 --> 00:18:01,745
And so we're going to look at this problem that we saw earlier of playing Atari
games, where our objective was to complete the game with the highest score

187
00:18:01,746 --> 00:18:05,460
and remember our state is going to be
the raw pixel inputs of the game state,

188
00:18:05,460 --> 00:18:12,963
and we can take these actions of moving left, right,
up, down, or whatever actions of the particular game.

189
00:18:12,964 --> 00:18:21,182
And our reward at each time step, we're going to get a reward of our score
increase or decrease that we got at this time step, and so our cumulative total

190
00:18:21,183 --> 00:18:27,095
reward is this total reward that we'll
usually see at the top of the screen.

191
00:18:27,095 --> 00:18:32,955
Okay, so the network that we're going to use for our
Q-function is going to look something like this,

192
00:18:32,955 --> 00:18:37,355
right, where we have our
Q-network, with weight's theta.

193
00:18:37,355 --> 00:18:43,791
And then our input, our state s, is
going to be our current game screen.

194
00:18:43,791 --> 00:18:49,509
And in practice we're going to take a stack of
the last four frames, so we have some history.

195
00:18:49,509 --> 00:18:58,608
And so we'll take these raw pixel values, we'll do some, you know, RGB to
gray-scale conversions, some down-sampling, some cropping, so, some pre-processing.

196
00:18:58,609 --> 00:19:04,631
And what we'll get out of this is this 84 by
84 by four stack of the last four frames.

197
00:19:04,631 --> 00:19:05,464
Yeah, question.

198
00:19:05,464 --> 00:19:09,631
[inaudible question from audience]

199
00:19:12,792 --> 00:19:20,808
Okay, so the question is, are we saying here that our network is going
to approximate our Q-value function for different state, action pairs,

200
00:19:20,809 --> 00:19:24,765
for example, four of these?
Yeah, that's correct.

201
00:19:24,765 --> 00:19:27,551
We'll see,
we'll talk about that in a few slides.

202
00:19:27,551 --> 00:19:29,935
[inaudible question from audience]

203
00:19:29,935 --> 00:19:36,815
So, no. So, we don't have a Softmax layer after the connected,
because here our goal is to directly predict our Q-value functions.

204
00:19:36,816 --> 00:19:40,582
[inaudible question from audience] Q-values.
[inaudible question from audience]

205
00:19:40,583 --> 00:19:44,014
Yes, so it's more doing
regression to our Q-values.

206
00:19:44,014 --> 00:19:54,083
Okay, so we have our input to this network and then on top of this, we're going
to have a couple of familiar convolutional layers, and a fully-connected layer,

207
00:19:54,084 --> 00:19:59,610
so here we have an eight-by-eight convolutions
and we have some four-by-four convolutions.

208
00:19:59,611 --> 00:20:05,673
Then we have a FC 256 layer, so this is just a
standard kind of networK that you've seen before.

209
00:20:05,674 --> 00:20:17,414
And then, finally, our last fully-connected layer has a vector of outputs, which is
corresponding to your Q-value for each action, right, given the state that you've input.

210
00:20:17,415 --> 00:20:21,770
And so, for example, if you have four actions,
then here we have this four-dimensional output

211
00:20:21,770 --> 00:20:28,685
corresponding to Q of current s, as well as
a-one, and then a-two, a-three, and a-four.

212
00:20:28,685 --> 00:20:33,179
Right so this is going to be one
scalar value for each of our actions.

213
00:20:33,179 --> 00:20:43,072
And then the number of actions that we have can vary
between, for example, 4 to 18, depending on the Atari game.

214
00:20:43,073 --> 00:20:46,709
And one nice thing here is that
using this network structure,

215
00:20:46,709 --> 00:20:54,650
a single feedforward pass is able to compute the
Q-values for all functions from the current state.

216
00:20:54,651 --> 00:20:56,117
And so this is really efficient.

217
00:20:56,117 --> 00:21:05,945
Right, so basically we take our current state in and then because we have this
output of an action for each, or Q-value for each action, as our output layer,

218
00:21:05,946 --> 00:21:10,259
we're able to do one pass and
get all of these values out.

219
00:21:10,259 --> 00:21:15,078
And then in order to train this, we're just
going to use our loss function from before.

220
00:21:15,078 --> 00:21:17,661
Remember, we're trying to
enforce this Bellman equation

221
00:21:17,661 --> 00:21:29,314
and so, on our forward pass, our loss function we're going to try and
iteratively make our Q-value close to our target value, that it should have.

222
00:21:29,315 --> 00:21:40,705
And then our backward pass is just directly taking the gradient of this
loss function that we have and then taking a gradient step based on that.

223
00:21:40,706 --> 00:21:45,639
So one other thing that's used here that I want
to mention is something called experience replay.

224
00:21:45,639 --> 00:21:53,440
And so this addresses a problem with just using
the plain two network that I just described,

225
00:21:53,440 --> 00:21:58,134
which is that learning from batches
of consecutive samples is bad.

226
00:21:58,134 --> 00:22:09,409
And so the reason because of this, right, is so for just playing the game, taking samples of state
action rewards that we have and just taking consecutive samples of these and training with these,

227
00:22:09,410 --> 00:22:16,117
well all of these samples are correlated and so
this leads to inefficient learning, first of all,

228
00:22:16,118 --> 00:22:24,841
and also, because of this, our current Q-network parameters, right, this
determines the policy that we're going to follow, it determines our next

229
00:22:24,842 --> 00:22:27,394
samples that we're going to get that
we're going to use for training.

230
00:22:27,394 --> 00:22:30,832
And so this leads to problems where
you can have bad feedback loops.

231
00:22:30,832 --> 00:22:35,468
So, for example, if currently the maximizing
action that's going to take left,

232
00:22:35,468 --> 00:22:43,305
well this is going to bias all of my upcoming training
examples to be dominated by samples from the left-hand side.

233
00:22:43,306 --> 00:22:45,406
And so this is a problem, right?

234
00:22:45,406 --> 00:22:53,097
And so the way that we are going to address these problems is by using
something called experience replay, where we're going to keep this

235
00:22:53,098 --> 00:23:01,352
replay memory table of transitions of state, as state, action,
reward, next state, transitions that we have, and we're going

236
00:23:01,353 --> 00:23:08,772
to continuously update this table with new transitions that we're
getting as game episodes are played, as we're getting more experience.

237
00:23:08,773 --> 00:23:16,334
Right, and so now what we can do is that we can now train our Q-network
on random, mini-batches of transitions from the replay memory.

238
00:23:16,335 --> 00:23:24,826
Right, so instead of using consecutive samples, we're now going to sample
across these transitions that we've accumulated random samples of these,

239
00:23:24,827 --> 00:23:31,007
and this breaks all of the, these
correlation problems that we had earlier.

240
00:23:31,007 --> 00:23:39,206
And then also, as another side benefit is that each of these transitions
can also contribute to potentially multiple weight updates.

241
00:23:39,207 --> 00:23:43,652
We're just sampling from this table and so
we could sample one multiple times.

242
00:23:43,652 --> 00:23:47,585
And so, this is going to lead
also to greater data efficiency.

243
00:23:50,580 --> 00:23:59,165
Okay, so let's put this all together and let's look at the
full algorithm for deep Q-learning with experience replay.

244
00:23:59,166 --> 00:24:03,940
So we're going to start off with
initializing our replay memory

245
00:24:03,940 --> 00:24:14,829
to some capacity that we choose, N, and then we're also going to
initialize our Q-network, just with our random weights or initial weights.

246
00:24:14,830 --> 00:24:21,832
And then we're going to play M episodes, or full
games. This is going to be our training episodes.

247
00:24:21,832 --> 00:24:31,264
And then what we're going to do is we're going to initialize our state,
using the starting game screen pixels at the beginning of each episode.

248
00:24:31,265 --> 00:24:37,814
And remember, we go through the pre-processing
step to get to our actual input state.

249
00:24:37,814 --> 00:24:41,584
And then for each time step
of a game that we're currently playing,

250
00:24:41,584 --> 00:24:46,268
we're going to, with a small probability,
select a random action,

251
00:24:46,268 --> 00:24:53,141
so one thing that's important in these
algorithms is to have sufficient exploration,

252
00:24:53,141 --> 00:24:58,559
so we want to make sure that we are sampling
different parts of the state space.

253
00:24:58,559 --> 00:25:03,613
And then otherwise, we're going to select from
the greedy action from the current policy.

254
00:25:03,614 --> 00:25:13,579
Right, so most of the time we'll take the greedy action that we think is a good policy of the
type of actions that we want to take and states that we want to see, and with a small probability

255
00:25:13,580 --> 00:25:16,300
we'll sample something random.

256
00:25:16,300 --> 00:25:26,069
Okay, so then we'll take this action, a, t, and we'll observe
the next reward and the next state. So r, t and s, t plus one.

257
00:25:26,070 --> 00:25:32,771
And then we'll take this and we'll store this
transition in our replay memory that we're building up.

258
00:25:32,771 --> 00:25:35,577
And then we're going to take, we're
going to train a network a little bit.

259
00:25:35,577 --> 00:25:37,550
So we're going to do experience replay

260
00:25:37,550 --> 00:25:47,213
and we'll take a sample of a random mini-batches of transitions that we have
from the replay memory, and then we'll perform a gradient descent step on this.

261
00:25:47,214 --> 00:25:49,635
Right, so this is going to
be our full training loop.

262
00:25:49,635 --> 00:26:03,886
We're going to be continuously playing this game and then also sampling minibatches, using
experienced replay to update our weights of our Q-network and then continuing in this fashion.

263
00:26:03,887 --> 00:26:20,910
Okay, so let's see. Let's see if I can, is this playing? Okay, so let's take a look at
this deep Q-learning algorithm from Google DeepMind, trained on an Atari game of Breakout.

264
00:26:20,911 --> 00:26:26,185
Alright, so it's saying here that our input is
just going to be our state are raw game pixels.

265
00:26:26,185 --> 00:26:29,520
And so here we're looking at what's
happening at the beginning of training.

266
00:26:29,520 --> 00:26:40,302
So we've just started training a bit. And right, so it's going to look to it's learned
to kind of hit the ball, but it's not doing a very good job of sustaining it.

267
00:26:40,303 --> 00:26:42,886
But it is looking for the ball.

268
00:26:50,969 --> 00:26:55,737
Okay, so now after some more training,
it looks like a couple hours.

269
00:27:00,946 --> 00:27:05,113
Okay, so now it's learning
to do a pretty good job here.

270
00:27:06,190 --> 00:27:16,592
So it's able to continuously follow this ball
and be able to to remove most of the blocks.

271
00:27:16,593 --> 00:27:36,203
Right, so after 240 minutes. Okay, so
here it's found the pro strategy, right?

272
00:27:36,203 --> 00:27:42,795
You want to get all the way to the top
and then have it go by itself. Okay, so

273
00:27:42,796 --> 00:27:49,500
this is an example of using deep Q-learning in order
to train an agent to be able to play Atari games.

274
00:27:49,501 --> 00:27:56,418
It's able to do this on many Atari games and
so you can check out some more of this online.

275
00:27:56,419 --> 00:28:01,149
Okay, so we've talked about Q-learning. But
there is a problem with Q-learning, right?

276
00:28:01,149 --> 00:28:07,225
It can be challenging and what's the problem? Well, the
problem can be that the Q-function is very complicated.

277
00:28:07,226 --> 00:28:12,335
Right, so we have to, we're saying that we want
to learn the value of every state action pair.

278
00:28:12,335 --> 00:28:17,275
So, if, let's say you have something, for example,
a robot grasping, wanting to grasp an object.

279
00:28:17,275 --> 00:28:19,576
Right, you're going to have a
really high dimensional state.

280
00:28:19,576 --> 00:28:26,225
You have, I mean, let's say you have all of your
even just joint, joint positions, and angles.

281
00:28:26,225 --> 00:28:35,492
Right, and so learning the exact value of every state action
pair that you have, right, can be really, really hard to do.

282
00:28:35,493 --> 00:28:38,724
But on the other hand, your
policy can be much simpler.

283
00:28:38,724 --> 00:28:44,555
Right, like what you want this robot to do maybe just to
have this simple motion of just closing your hand, right?

284
00:28:44,556 --> 00:28:48,252
Just, move your fingers in this
particular direction and keep going.

285
00:28:48,252 --> 00:28:54,142
And so, that leads to the question of
can we just learn this policy directly?

286
00:28:54,142 --> 00:28:58,306
Right, is it possible, maybe, to just find the
best policy from a collection of policies,

287
00:28:58,306 --> 00:29:06,789
without trying to go through this process of estimating
your Q-value and then using that to infer your policy.

288
00:29:06,790 --> 00:29:15,937
So, this is an approach that oh, so, okay, this is an
approach that we're going to call policy gradients.

289
00:29:15,938 --> 00:29:20,858
And so, formally, let's define a
class of parametrized policies.

290
00:29:20,858 --> 00:29:24,146
Parametrized by weights theta,

291
00:29:24,146 --> 00:29:27,791
and so for each policy
let's define the value of the policy.

292
00:29:27,791 --> 00:29:35,722
So, J, our value J, given parameters theta, is going to be, or
expected some cumulative sum of future rewards that we care about.

293
00:29:35,723 --> 00:29:38,971
So, the same reward that we've been using.

294
00:29:38,971 --> 00:29:41,879
And so our goal then, under this setup

295
00:29:41,879 --> 00:29:51,547
is that we want to find an optimal policy, theta star, which
is the maximum, right, arg max over theta of J of theta.

296
00:29:51,548 --> 00:29:56,917
So we want to find the policy, the policy
parameters that gives our best expected reward.

297
00:29:56,917 --> 00:30:01,011
So, how can we do this?
Any ideas?

298
00:30:04,993 --> 00:30:10,155
Okay, well, what we can do is just a gradient
assent on our policy parameters, right?

299
00:30:10,155 --> 00:30:15,460
We've learned that given some objective that we have,
some parameters we can just use gradient asscent

300
00:30:15,460 --> 00:30:20,762
and gradient assent in order
to continuously improve our parameters.

301
00:30:23,202 --> 00:30:29,196
And so let's talk more specifically about how we can do this,
which we're going to call here the reinforce algorithm.

302
00:30:29,196 --> 00:30:36,781
So, mathematically, we can write out our expected future
reward over trajectories, and so we're going to sample

303
00:30:36,781 --> 00:30:41,902
these trajectories of experience, right, like for example
episodes of game play that we talked about earlier.

304
00:30:41,902 --> 00:30:47,411
S-zero, a-zero, r-zero, s-one,
a-one, r-one, and so on.

305
00:30:47,411 --> 00:30:51,723
Using some policy pi of theta.

306
00:30:51,723 --> 00:30:57,739
Right, and then so, for each trajectory we
can compute a reward for that trajectory.

307
00:30:57,739 --> 00:31:01,245
It's the cumulative reward that we
got from following this trajectory.

308
00:31:01,245 --> 00:31:10,570
And then the value of a policy, pi sub theta, is going to be just the expected
reward of these trajectories that we can get from the following pi sub theta.

309
00:31:10,570 --> 00:31:16,868
So that's here, this expectation over trajectories that
we can get, sampling trajectories from our policy.

310
00:31:18,563 --> 00:31:21,288
Okay.
So, we want to do gradient ascent, right?

311
00:31:21,288 --> 00:31:22,961
So let's differentiate this.

312
00:31:22,961 --> 00:31:27,356
Once we differentiate this, then we can
just take gradient steps, like normal.

313
00:31:28,535 --> 00:31:34,300
So, the problem is that now if we try and just
differentiate this exactly, this is intractable, right?

314
00:31:34,300 --> 00:31:41,319
So, the gradient of an expectation is problematic
when p is dependent on theta here, because here

315
00:31:41,319 --> 00:31:47,661
we want to take this gradient
of p of tau, given theta,

316
00:31:47,661 --> 00:31:53,033
but this is going to be, we want to take this
integral over tau. Right, so this is intractable.

317
00:31:53,033 --> 00:31:57,327
However, we can use a trick
here to get around this.

318
00:31:57,327 --> 00:32:04,941
And this trick is taking this gradient that we want, of
p. We can rewrite this by just multiplying this by one,

319
00:32:04,941 --> 00:32:10,286
by multiplying top and bottom,
both by p of tau given theta.

320
00:32:10,286 --> 00:32:26,170
Right, and then if we look at these terms that we have now here, in the way that I've written this, on the left and the
right, this is actually going to be equivalent to p of tau times our gradient with respect to theta, of log, of p.

321
00:32:26,170 --> 00:32:32,741
Right, because the gradient of the log of p is
just going to be one over p times gradient of p.

322
00:32:33,808 --> 00:32:41,385
Okay, so if we then inject this back into our
expression that we had earlier for this gradient,

323
00:32:41,385 --> 00:32:43,426
we can see that, what this
will actually look like,

324
00:32:43,426 --> 00:32:52,187
right, because now we have a gradient of log p times our probabilities of
all of these trajectories and then taking this integral here, over tau.

325
00:32:52,187 --> 00:32:58,586
This is now going to be an expectation over our
trajectories tau, and so what we've done here

326
00:32:58,586 --> 00:33:02,751
is that we've taken a
gradient of an expectation

327
00:33:02,751 --> 00:33:06,823
and we've transformed it into
an expectation of gradients.

328
00:33:06,823 --> 00:33:13,817
Right, and so now we can use sample trajectories
that we can get in order to estimate our gradient.

329
00:33:14,712 --> 00:33:21,260
And so we do this using Monte Carlo sampling,
and this is one of the core ideas of reinforce.

330
00:33:23,624 --> 00:33:28,180
Okay, so looking at this
expression that we want to compute,

331
00:33:28,180 --> 00:33:33,071
can we compute these quantities that we had here
without knowing the transition probabilities?

332
00:33:33,071 --> 00:33:38,466
Alright, so we have that p of tau is going
to be the probability of a trajectory.

333
00:33:38,466 --> 00:33:45,821
It's going to be the product of all of our transition probabilities
of the next state that we get, given our current state and action

334
00:33:45,821 --> 00:33:52,232
as well as our probability of the actions
that we've taken under our policy pi.

335
00:33:52,232 --> 00:33:58,441
Right, so we're going to multiply all of these
together, and get our probability of our trajectory.

336
00:33:58,441 --> 00:34:10,389
So this log of p of tau that we want to compute is going to be we just take
this log and this will separate this out into a sum of pushing the logs inside.

337
00:34:10,389 --> 00:34:18,161
And then here, when we differentiate this, we can see we want to
differentiate with respect to theta, but this first term that we have here,

338
00:34:18,163 --> 00:34:22,850
log p of the state transition probabilities
there's no theta term here, and so

339
00:34:22,850 --> 00:34:28,709
the only place where we have theta is the second
term that we have, of log of pi sub theta,

340
00:34:29,675 --> 00:34:39,669
of our action, given our state, and so this is the only term that we keep in our gradient
estimate, and so we can see here that this doesn't depend on the transition probabilities,

341
00:34:39,670 --> 00:34:46,421
right, so we actually don't need to know our transition
probabilities in order to computer our gradient estimate.

342
00:34:47,257 --> 00:34:57,264
And then, so, therefore when we're sampling these, for any given
trajectory tau, we can estimate J of theta using this gradient estimate.

343
00:34:58,524 --> 00:35:02,220
This is here shown for a single trajectory
from what we had earlier,

344
00:35:02,220 --> 00:35:07,188
and then we can also sample over multiple
trajectories to get the expectation.

345
00:35:09,248 --> 00:35:17,141
Okay, so given this gradient estimator that we've derived,
the interpretation that we can make from this here, is that

346
00:35:18,217 --> 00:35:25,226
if our reward for a trajectory is high, if the reward
that we got from taking the sequence of actions was good,

347
00:35:25,226 --> 00:35:29,434
then let's push up the probabilities
of all the actions that we've seen.

348
00:35:29,434 --> 00:35:33,141
Right, we're just going to say that
these were good actions that we took.

349
00:35:33,141 --> 00:35:37,186
And then if the reward is low,
we want to push down these probabilities.

350
00:35:37,186 --> 00:35:40,747
We want to say these were bad actions,
let's try and not sample these so much.

351
00:35:40,747 --> 00:35:50,980
Right and so we can see that's what's happening here, where we have pi
of a, given s. This is the likelihood of the actions that we've taken

352
00:35:50,980 --> 00:36:03,575
and then we're going to scale this, we're going to take the gradient and the gradient is going to tell
us how much should we change the parameters in order to increase our likelihood of our action, a, right?

353
00:36:03,575 --> 00:36:12,602
And then we're going to take this and scale it by how much reward we
actually got from it, so how good were these actions, in reality.

354
00:36:14,561 --> 00:36:23,798
Okay, so this might seem simplistic to say that, you know, if a trajectory
is good, then we're saying here that all of its actions were good. Right?

355
00:36:23,798 --> 00:36:26,356
But, in expectation, this
actually averages out.

356
00:36:26,356 --> 00:36:30,125
So we have an unbiased estimator here,

357
00:36:30,125 --> 00:36:35,622
and so if you have many samples of this, then we
will get an accurate estimate of our gradient.

358
00:36:35,622 --> 00:36:42,994
And this is nice because we can just take gradient steps and we know
that we're going to be improving our loss function and getting closer

359
00:36:42,994 --> 00:36:48,602
to, at least some local optimum of our
policy parameters theta.

360
00:36:48,602 --> 00:36:54,884
Alright, but there is a problem with this, and the
problem is that this also suffers from high variance.

361
00:36:54,884 --> 00:36:57,201
Because this credit
assignment is really hard.

362
00:36:57,201 --> 00:37:04,412
Right, we're saying that given a reward that we got, we're going
to say all of the actions were good, we're just going to hope

363
00:37:04,412 --> 00:37:11,080
that this assignment of which actions were actually the best
actions, that mattered, are going to average out over time.

364
00:37:11,080 --> 00:37:17,190
And so this is really hard and we need a lot
of samples in order to have a good estimate.

365
00:37:17,190 --> 00:37:23,851
Alright, so this leads to the question of, is there anything
that we can do to reduce the variance and improve the estimator?

366
00:37:26,540 --> 00:37:33,323
And so variance reduction is an important
area of research in policy gradients,

367
00:37:33,323 --> 00:37:39,756
and in coming up with ways in order to improve
the estimator and require fewer samples.

368
00:37:39,756 --> 00:37:43,278
Alright, so let's look at a couple
of ideas of how we can do this.

369
00:37:44,202 --> 00:37:46,764
So given our gradient estimator,

370
00:37:46,764 --> 00:37:57,091
so the first idea is that we can push up the probabilities of an
action only by it's affect on future rewards from that state, right?

371
00:37:57,091 --> 00:38:04,736
So, now with instead of scaling this likelihood, or pushing up this
likelihood of this action by the total reward of its trajectory,

372
00:38:04,736 --> 00:38:12,108
let's look more specifically at just the sum of rewards
coming from this time step on to the end, right?

373
00:38:12,108 --> 00:38:18,999
And so, this is basically saying that how good an action is,
is only specified by how much future reward it generates.

374
00:38:18,999 --> 00:38:20,499
Which makes sense.

375
00:38:21,811 --> 00:38:29,448
Okay, so a second idea that we can also use is using a
discount factor in order to ignore delayed effects.

376
00:38:29,448 --> 00:38:36,774
Alright so here we've added back in this discount
factor, that we've seen before, which is saying that

377
00:38:36,774 --> 00:38:47,276
we are, you know, our discount factor's going to tell us how much we care about
just the rewards that are coming up soon, versus rewards that came much later on.

378
00:38:47,276 --> 00:38:57,489
Right, so we were going to now say how good or bad an action is, looking more at the local neighborhood of action set it generates in the immediate near future

379
00:38:57,489 --> 00:39:00,880
and down weighting the the
ones that come later on.

380
00:39:00,880 --> 00:39:07,730
Okay so these are some straightforward
ideas that are generally used in practice.

381
00:39:07,730 --> 00:39:14,597
So, a third idea is this idea of using a
baseline in order to reduce your variance.

382
00:39:14,597 --> 00:39:20,690
And so, a problem with just using the
raw value of your trajectories, is that

383
00:39:21,675 --> 00:39:23,869
this isn't necessarily meaningful, right?

384
00:39:23,869 --> 00:39:29,835
So, for example, if your rewards are all positive, then you're just
going to keep pushing up the probabilities of all your actions.

385
00:39:29,835 --> 00:39:32,039
And of course, you'll push
them up to various degrees,

386
00:39:32,039 --> 00:39:39,753
but what's really important is whether a reward is better
or worse than what you're expecting to be getting.

387
00:39:39,753 --> 00:39:46,071
Alright, so in order to address this, we can introduce
a baseline function that's dependent on the state.

388
00:39:46,071 --> 00:39:53,886
Right, so this baseline function tell us what's, how much we, what's
our guess and what we expect to get from this state, and then

389
00:39:55,515 --> 00:39:59,837
our reward or our scaling factor that we're going
to use to be pushing up or down our probabilities,

390
00:39:59,837 --> 00:40:05,508
can now just be our expected sum of future rewards,
minus this baseline, so now it's the relative of how

391
00:40:05,508 --> 00:40:10,772
much better or worse is the reward
that we got from what we expected.

392
00:40:11,870 --> 00:40:16,099
And so how can we choose this baseline?
Well,

393
00:40:16,099 --> 00:40:19,168
a very simple baseline, the
most simple you can use,

394
00:40:19,168 --> 00:40:23,013
is just taking a moving average
of rewards that you've experienced so far.

395
00:40:23,013 --> 00:40:34,765
So you can even do this overall trajectories, and this is just an average of what rewards
have I been seeing as I've been training, and as I've been playing these episodes?

396
00:40:34,765 --> 00:40:41,716
Right, and so this gives some idea of whether the reward
that I currently get was relatively better or worse.

397
00:40:42,821 --> 00:40:54,452
And so there's some variance on this that you can use but so far the variance reductions that
we've seen so far are all used in what's typically called "vanilla REINFORCE" algorithm.

398
00:40:54,452 --> 00:41:00,954
Right, so looking at the cumulative future reward,
having a discount factor, and some simple baselines.

399
00:41:02,601 --> 00:41:08,769
Now let's talk about how we can think about this idea
of baseline and potentially choose better baselines.

400
00:41:08,769 --> 00:41:13,567
Right, so if we're going to think about
what's a better baseline that we can choose,

401
00:41:13,567 --> 00:41:24,255
what we want to do is we want to push up the probability of an action from a state, if
the action was better than the expected value of what we should get from that state.

402
00:41:24,255 --> 00:41:30,163
So, thinking about the value of what we're going to
expect from the state, what does this remind you of?

403
00:41:30,163 --> 00:41:34,939
Does this remind you of anything that
we talked about earlier in this lecture?

404
00:41:37,023 --> 00:41:41,297
Yes. [inaudible from audience]
Yeah, so the value functions, right?

405
00:41:41,297 --> 00:41:47,871
The value functions that we talked about with Q-learning.
So, exactly. So Q-functions and value functions

406
00:41:47,871 --> 00:41:58,173
and so, the intuition is that well, we're happy
with an action, taking an action in a state s, if

407
00:41:58,173 --> 00:42:04,752
our Q-value of taking a specific
action from this state is larger than

408
00:42:04,752 --> 00:42:09,698
the value function or expected value of the cumulative
future reward that we can get from this state.

409
00:42:09,698 --> 00:42:14,416
Right, so this means that this action was better
than other actions that we could've taken.

410
00:42:14,416 --> 00:42:22,063
And on the contrary, we're unhappy if this action, if
this value or this difference is negative or small.

411
00:42:23,917 --> 00:42:34,868
Right, so now if we plug this in, in order to, as our scaling factor of how much we want
to push up or down, our probabilities of our actions, then we can get this estimator here.

412
00:42:34,868 --> 00:42:40,168
Right, so, it's going to be
exactly the same as before, but now where

413
00:42:40,168 --> 00:42:50,514
we've had before our cumulative expected reward, with our various reduction,
variance reduction techniques and baselines in, here we can just plug in now

414
00:42:50,514 --> 00:43:00,530
this difference of how much better our current action was, based
on our Q-function minus our value function from that state.

415
00:43:01,771 --> 00:43:09,413
Right, but what we talked about so far with our REINFORCE
algorithm, we don't know what Q and V actually are.

416
00:43:09,413 --> 00:43:14,479
So can we learn these?
And the answer is yes, using Q-learning.

417
00:43:14,479 --> 00:43:16,465
What we've already talked about before.

418
00:43:16,465 --> 00:43:22,210
So we can combine policy gradients while we've
just been talking about, with Q-learning,

419
00:43:22,210 --> 00:43:28,784
by training both an actor, which is the policy,
as well as a critic, right, a Q-function,

420
00:43:28,784 --> 00:43:34,380
which is going to tell us how good we
think a state is, and an action in a state.

421
00:43:34,380 --> 00:43:40,633
Right, so using this in approach, an actor
is going to decide which action to take

422
00:43:40,633 --> 00:43:47,708
and then the critic, or Q-function, is going to tell the
actor how good its action was and how it should adjust.

423
00:43:47,708 --> 00:43:53,636
And so, and this also alleviates a little bit of the
task of this critic compared to the Q-learning problems

424
00:43:53,636 --> 00:43:59,958
that we talked about earlier of having to have this
learning a Q-value for every state, action pair,

425
00:43:59,958 --> 00:44:04,762
because here it only has to learn this for the
state-action pairs that are generated by the policy.

426
00:44:04,762 --> 00:44:10,512
It only needs to know this where it
matters for computing this scaling factor.

427
00:44:10,512 --> 00:44:18,188
Right, and then we can also, as we're learning this, incorporate all of
the Q-learning tricks that we saw earlier, such as experience replay.

428
00:44:18,188 --> 00:44:24,610
And so, now I'm also going to just
define this term that we saw earlier,

429
00:44:24,610 --> 00:44:38,172
Q of s of a, how much, how good was an action in a given state, minus V of s?
Our expected value of how good the state is by this term advantage function.

430
00:44:38,172 --> 00:44:43,568
Right, so the advantage function is how much
advantage did we get from playing this action?

431
00:44:43,568 --> 00:44:48,100
How much better the
action was than expected.

432
00:44:48,100 --> 00:44:53,457
So, using this, we can put together
our full actor-critic algorithm.

433
00:44:53,457 --> 00:44:56,279
And so what this looks like,
is that we're going to start

434
00:44:56,279 --> 00:45:03,689
off with by initializing our policy parameters theta,
and our critic parameters that we'll call phi.

435
00:45:03,689 --> 00:45:11,306
And then for each, for iterations of training, we're going
to sample M trajectories, under the current policy.

436
00:45:12,185 --> 00:45:18,725
Right, we're going to play our policy and get these
trajectories as s-zero, a-zero, r-zero, s-one and so on.

437
00:45:18,725 --> 00:45:21,671
Okay, and then we're going to compute
the gradients that we want.

438
00:45:21,671 --> 00:45:33,465
Right, so for each of these trajectories and in each time step, we're going to compute
this advantage function, and then we're going to use this advantage function, right?

439
00:45:33,465 --> 00:45:42,894
And then we're going to use that in our gradient estimator that we showed
earlier, and accumulate our gradient estimate that we have for here.

440
00:45:42,894 --> 00:45:50,837
And then we're also going to train our critic
parameters phi by exactly the same way,

441
00:45:50,837 --> 00:45:57,557
so as we saw earlier, basically trying to enforce this
value function, right, to learn our value function,

442
00:45:57,557 --> 00:46:10,347
which is going to be pulled into, just minimizing this advantage function and this
will encourage it to be closer to this Bellman equation that we saw earlier, right?

443
00:46:10,347 --> 00:46:19,361
And so, this is basically just iterating between learning and
optimizing our policy function, as well as our critic function.

444
00:46:20,211 --> 00:46:26,727
And so then we're going to update the gradients and then we're
going to go through and just continuously repeat this process.

445
00:46:29,271 --> 00:46:31,827
Okay, so now let's look at
some examples of REINFORCE

446
00:46:31,827 --> 00:46:39,464
in action, and let's look first here at something called
the Recurrent Attention Model, which is something that,

447
00:46:39,464 --> 00:46:49,146
which is a model also referred to as hard attention, but you'll see
a lot in, recently, in computer vision tasks for various purposes.

448
00:46:49,146 --> 00:46:59,167
Right, and so the idea behind this is here, I've talked about the original
work on hard attention, which is on image classification, and your goal is

449
00:46:59,167 --> 00:47:06,494
to still predict the image class, but now you're going to
do this by taking a sequence of glimpses around the image.

450
00:47:06,494 --> 00:47:10,300
You're going to look at local
regions around the image

451
00:47:10,300 --> 00:47:17,141
and you're basically going to selectively focus on these
parts and build up information as you're looking around.

452
00:47:17,141 --> 00:47:24,551
Right, and so the reason that we want to do this is, well, first of all
it has some nice inspiration from human perception in eye movement.

453
00:47:24,551 --> 00:47:29,225
Let's say we're looking at a complex image and
we want to determine what's in the image.

454
00:47:29,225 --> 00:47:39,168
Well, you know, we might, maybe look at a low-resolution of it first, and then look
specifically at parts of the image that will give us clues about what's in this image.

455
00:47:39,168 --> 00:47:49,276
And then, this approach of just looking at, looking around at an image at
local regions, is also going to help you save computational resources, right?

456
00:47:50,435 --> 00:47:53,293
You don't need to process the full image.

457
00:47:53,293 --> 00:48:03,576
In practice, what usually happens is you look at a low-resolution image first, of a full image,
to decide how to get started, and then you look at high-res portions of the image after that.

458
00:48:04,671 --> 00:48:15,177
And so this saves a lot of computational resources and you can think about, then, benefits of
this to scalability, right, being able to, let's say process larger images more efficiently.

459
00:48:16,164 --> 00:48:20,099
And then, finally, this could also actually
help with actual classification performance,

460
00:48:20,099 --> 00:48:24,760
because now you're able to ignore clutter
and irrelevant parts of the image.

461
00:48:24,760 --> 00:48:29,931
Right? Like, you know, instead of always putting
through your ConvNet, all the parts of your image,

462
00:48:29,931 --> 00:48:36,353
you can use this to, maybe, first prune out what are the relevant
parts that I actually want to process, using my ConvNet.

463
00:48:37,237 --> 00:48:41,531
Okay, so what's the reinforcement learning
formulation of this problem?

464
00:48:41,531 --> 00:48:51,117
Well, our state is going to be the glimpses that we've seen
so far, right? Our what's the information that we've seen?

465
00:48:51,117 --> 00:48:55,228
Our action is then going to be where
to look next in the image.

466
00:48:55,228 --> 00:49:02,842
Right, so in practice, this can be something like the x, y-coordinates, maybe
centered around some fixed-sized glimpse that you want to look at next.

467
00:49:02,842 --> 00:49:12,423
And then the reward for the classification problem is going to be one, at the
final time step, if our image is correctly classified, and zero otherwise.

468
00:49:14,495 --> 00:49:24,550
And so, because this glimpsing, taking these glimpses around the image is a
non-differentiable operation, this is why we need to use reinforcement learning formulation,

469
00:49:25,761 --> 00:49:31,792
and learn policies for how to take these glimpse
actions and we can train this using REINFORCE.

470
00:49:31,792 --> 00:49:40,891
So, given the state of glimpses so far, the core of our model is
going to be this RNN that we're going to use to model the state,

471
00:49:40,891 --> 00:49:47,418
and then we're going to use our policy
parameters in order to output the next action.

472
00:49:49,354 --> 00:49:54,571
Okay, so what this model looks like
is we're going to take an input image.

473
00:49:54,571 --> 00:50:03,184
Right, and then we're going to take a glimpse at this image. So here,
this glimpse is the red box here, and this is all blank, zeroes.

474
00:50:03,184 --> 00:50:12,193
And so we'll pass what we see so far into some neural network,
and this can be any kind of network depending on your task.

475
00:50:12,193 --> 00:50:18,758
In the original experiments that I'm showing here, on MNIST, this is very
simple, so you can just use a couple of small, fully-connected layers,

476
00:50:18,758 --> 00:50:26,105
but you can imagine for more complex images and
other tasks you may want to use fancier ConvNets.

477
00:50:26,105 --> 00:50:28,775
Right, so you've passed this
into some neural network,

478
00:50:28,775 --> 00:50:36,115
and then, remember I said we're also going to be integrating our state
of, glimpses that we've seen so far, using a recurrent network.

479
00:50:36,115 --> 00:50:40,265
So, I'm just going to we'll see that later
on, but this is going to go through that,

480
00:50:40,265 --> 00:50:46,094
and then it's going to output my x,
y-coordinates, of where I'm going to see next.

481
00:50:46,094 --> 00:50:50,766
And in practice, this is going to be We
want to output a distribution over actions,

482
00:50:50,766 --> 00:50:57,282
right, and so, what this is going to be it's going to be a
gaussian distribution and we're going to output the mean.

483
00:50:57,282 --> 00:51:03,944
You can also output a mean and variance of this
distribution in practice. The variance can also be fixed.

484
00:51:03,944 --> 00:51:11,854
Okay, so we're going to take this action that we're now going
to sample a specific x, y location from our action distribution

485
00:51:11,854 --> 00:51:17,777
and then we're going to put this in to get the
next, extract the next glimpse from our image.

486
00:51:17,777 --> 00:51:23,385
Right, so here we've moved to the end
of the two, this tail part of the two.

487
00:51:23,385 --> 00:51:26,745
And so now we're actually starting to get
some signal of what we want to see, right?

488
00:51:26,745 --> 00:51:32,724
Like, what we want to do is we want to look at the relevant
parts of the image that are useful for classification.

489
00:51:32,724 --> 00:51:37,104
So we pass this through, again, our neural
network layers, and then also through

490
00:51:38,153 --> 00:51:47,343
our recurrent network, right, that's taking this input as well as this previous hidden
state, and we're going to use this to get a, so this is representing our policy,

491
00:51:47,343 --> 00:51:54,095
and then we're going to use this to output our distribution
for the next location that we want to glimpse at.

492
00:51:54,095 --> 00:51:57,303
So we can continue doing this,
you can see in this next glimpse here,

493
00:51:57,303 --> 00:51:59,903
we've moved a little bit more
toward the center of the two.

494
00:51:59,903 --> 00:52:14,543
Alright, so it's probably learning that, you know, once I've seen this tail part of the two, that looks like this, maybe moving
in this upper left-hand direction will get you more towards a center, which will also have a value, valuable information.

495
00:52:14,543 --> 00:52:17,473
And then we can keep doing this.

496
00:52:17,473 --> 00:52:26,795
And then finally, at the end, at our last time step, so we can have a
fixed number of time steps here, in practice something like six or eight.

497
00:52:26,795 --> 00:52:33,350
And then at the final time step, since we want
to do classification, we'll have our standard

498
00:52:33,350 --> 00:52:39,363
Softmax layer that will produce a
distribution of probabilities for each class.

499
00:52:39,363 --> 00:52:44,108
And then here the max class was a two,
so we can predict that this was a two.

500
00:52:44,108 --> 00:52:50,428
Right, and so this is going to be the set up of
our model and our policy, and then we have our

501
00:52:50,428 --> 00:52:59,569
estimate for the gradient of this policy that we've said earlier we could
compute by taking trajectories from here and using those to do back prop.

502
00:52:59,569 --> 00:53:08,698
And so we can just do this in order to train this model and learn the
parameters of our policy, right? All of the weights that you can see here.

503
00:53:09,953 --> 00:53:15,257
Okay, so here's an example of
a policies trained on MNIST,

504
00:53:16,710 --> 00:53:27,685
and so you can see that, in general, from wherever it's starting, usually learns to go closer to where
the digit is, and then looking at the relevant parts of the digit, right? So this is pretty cool and

505
00:53:27,685 --> 00:53:30,460
this you know, follows kind of
what you would expect, right,

506
00:53:30,460 --> 00:53:38,400
if you were to choose places to look next in order
to most efficiently determine what digit this is.

507
00:53:40,108 --> 00:53:49,758
Right, and so this idea of hard attention, of recurrent attention models,
has also been used in a lot of tasks in computer vision in the last

508
00:53:49,758 --> 00:53:54,790
couple of years, so you'll see this, used,
for example, fine-grained image recognition.

509
00:53:54,790 --> 00:54:01,763
So, I mentioned earlier that one of the
useful benefits of this can be also to

510
00:54:02,975 --> 00:54:10,180
both save on computational efficiency as well as to ignore clutter
and irrelevant parts of the image, and when you have fine-grained

511
00:54:10,180 --> 00:54:13,092
image classification problems,
you usually want both of these.

512
00:54:13,092 --> 00:54:25,777
You want to keep high-resolution, so that you can look at, you know, important differences.
And then you also want to focus on these differences and ignore irrelevant parts.

513
00:54:25,777 --> 00:54:27,359
Yeah, question.

514
00:54:27,359 --> 00:54:31,526
[inaudible question from audience]

515
00:54:35,061 --> 00:54:41,482
Okay, so yeah, so the question is how is there is computational
efficiency, because we also have this recurrent neural network in place.

516
00:54:41,482 --> 00:54:47,761
So that's true, it depends on exactly what's your,
what is your problem, what is your network, and so on,

517
00:54:47,761 --> 00:54:52,512
but you can imagine that if you had
some really hi- resolution image

518
00:54:52,512 --> 00:55:01,900
and you don't want to process the entire parts of this image with some huge
ConvNet or some huge, you know, network, now you can get some savings by just

519
00:55:01,900 --> 00:55:04,669
focusing on specific
smaller parts of the image.

520
00:55:04,669 --> 00:55:06,589
You only process those parts of the image.

521
00:55:06,589 --> 00:55:10,924
But, you're right, that it depends on
exactly what problem set-up you have.

522
00:55:12,210 --> 00:55:14,530
This has also been used
in image captioning,

523
00:55:14,530 --> 00:55:17,138
so if we're going to produce
an caption for an image,

524
00:55:17,138 --> 00:55:23,197
we can choose, you know, we can have the image use
this attention model to generate this caption

525
00:55:23,197 --> 00:55:28,999
and what it usually ends up learning is these policies
where it'll focus on specific parts of the image,

526
00:55:28,999 --> 00:55:38,341
in sequence, and as it focuses on each part, it'll generate some words
or the part of the caption referring to that part of the image.

527
00:55:38,341 --> 00:55:42,948
And then it's also been used, also
tasks such as visual question answering,

528
00:55:42,948 --> 00:55:51,786
where we ask a question about the image and you want the model to
output some answer to your question, for example, I don't know,

529
00:55:51,786 --> 00:55:53,840
how many chairs are around the table?

530
00:55:53,840 --> 00:56:03,040
And so you can see how this attention mechanism might be a
good type of model for learning how to answer these questions.

531
00:56:05,707 --> 00:56:10,564
Okay, so that was an example of policy
gradients in these hard attention models.

532
00:56:10,564 --> 00:56:16,430
And so, now I'm going to talk about one more
example, that also uses policy gradients,

533
00:56:16,430 --> 00:56:18,465
which is learning how to play Go.

534
00:56:18,465 --> 00:56:24,497
Right, so DeepMind had this agent
for playing Go, called AlphGo,

535
00:56:24,497 --> 00:56:30,708
that's been in the news a lot
in the past, last year and this year.

536
00:56:30,708 --> 00:56:31,541
So, sorry?

537
00:56:31,541 --> 00:56:32,374
[inaudible comment from audience]

538
00:56:32,374 --> 00:56:39,258
And yesterday, yes, that's correct. So
this is very exciting, recent news as well.

539
00:56:39,258 --> 00:56:57,886
So last year, a first version of AlphaGo was put into a competition against one of the best Go players
of recent years, Lee Sedol, and the agent was able to beat him four to one, in a game of five matches.

540
00:56:57,886 --> 00:57:09,348
And actually, right now, just there's another match with Ke Jie, which is
current world number one, and so it's best of three in China right now.

541
00:57:09,348 --> 00:57:13,436
And so the first game was yesterday.
AlphaGo won.

542
00:57:13,436 --> 00:57:20,657
I think it was by just half a point, and so, so there's
two more games to watch. These are all live-stream, so

543
00:57:20,657 --> 00:57:28,193
you guys, should also go online and watch these games.
It's pretty interesting to hear the commentary.

544
00:57:29,225 --> 00:57:32,577
But, so what is this AlphaGo
agent, right, from DeepMind?

545
00:57:32,577 --> 00:57:36,466
And it's based on a lot of what we've
talked about so far in this lecture.

546
00:57:36,466 --> 00:57:42,045
And what it is it's a mixed of supervised
learning and reinforcement learning,

547
00:57:42,045 --> 00:57:51,656
as well as a mix of some older methods for Go, Monte
Carlo Tree Search, as well as recent deep RL approaches.

548
00:57:52,579 --> 00:57:56,986
So, okay, so how does AlphaGo
beat the Go world champion?

549
00:57:56,986 --> 00:58:08,739
Well, what it first does is to train AlphaGo, what it takes as input is going to be a few featurization
of the board. So it's basically, right, your board and the positions of the pieces on the board.

550
00:58:08,739 --> 00:58:10,739
That's your natural state representation.

551
00:58:10,739 --> 00:58:17,958
And what they do in order to improve performance a
little bit is that they featurize this into some

552
00:58:17,958 --> 00:58:23,790
more channels of one is all the different stone colors,
so this is kind of like your configuration of your board.

553
00:58:23,790 --> 00:58:31,138
Also some channels, for example, where, which moves
are legal, some bias channels, some various things

554
00:58:31,138 --> 00:58:36,518
and then, given this state, right,
it's going to first train a network

555
00:58:36,518 --> 00:58:40,897
that's initialized with supervised
training from professional Go games.

556
00:58:40,897 --> 00:58:48,495
So, given the current board configuration or features,
featurization of this, what's the correct next action to take?

557
00:58:48,495 --> 00:58:55,608
Alright, so given examples of professional games
played, you know, just collected over time,

558
00:58:55,608 --> 00:59:02,605
we can just take all of these professional Go moves, train a
standard, supervised mapping, from board state to action to take.

559
00:59:02,605 --> 00:59:05,365
Alright, so they take this,
which is a pretty good start,

560
00:59:05,365 --> 00:59:09,227
and then they're going to use this
to initialize a policy network.

561
00:59:09,227 --> 00:59:17,778
Right, so policy network, it's just going to take the exact same structure of input
is your board state and your output is the actions that you're going to take.

562
00:59:17,778 --> 00:59:21,978
And this was the set-up for the policy
gradients that we just saw, right?

563
00:59:21,978 --> 00:59:27,130
So now we're going to just continue
training this using policy gradients.

564
00:59:27,130 --> 00:59:35,123
And it's going to do this reinforcement learning training
by playing against itself for random, previous iterations.

565
00:59:35,123 --> 00:59:42,243
So self play, and the reward it's going to get is
one, if it wins, and a negative one if it loses.

566
00:59:42,243 --> 00:59:47,573
And what we're also going to do is we're also going to
learn a value network, so, something like a critic.

567
00:59:47,573 --> 01:00:01,475
And then, the final AlphaGo is going to be combining all of these together, so policy and value networks
as well as with a Monte Carlo Tree Search algorithm, in order to select actions by look ahead search.

568
01:00:01,475 --> 01:00:19,891
Right, so after putting all this together, a value of a node, of where you are in play, and what you do next, is going to be a
combination of your value function, as well as roll at outcome that you're computing from standard Monte Carlo Tree Search roll outs.

569
01:00:19,891 --> 01:00:27,453
Okay, so, yeah, so this is basically
the various, the components of AlphaGo.

570
01:00:28,397 --> 01:00:33,814
If you're interested in reading more about this,
there's a nature paper about this in 2016,

571
01:00:34,664 --> 01:00:47,953
and they trained this, I think, over, the version of AlphaGo that's being used in these matches is,
like, I think a couple thousand CPUs plus a couple hundred GPUs, putting all of this together,

572
01:00:47,953 --> 01:00:52,120
so it's a huge amount of
training that's going on, right.

573
01:00:55,659 --> 01:01:01,529
And yeah, so you guys should, follow the
game this week. It's pretty exciting.

574
01:01:03,491 --> 01:01:10,858
Okay, so in summary, today we've talked about
policy gradients, right, which are general.

575
01:01:10,858 --> 01:01:18,855
They, you're just directly taking gradient
descent or ascent on your policy parameters,

576
01:01:18,855 --> 01:01:23,947
so this works well for a large class of problems,
but it also suffers from high variance,

577
01:01:23,947 --> 01:01:28,669
so it requires a lot of samples, and
your challenge here is sample efficiency.

578
01:01:28,669 --> 01:01:35,349
We also talked about Q-learning, which doesn't
always work, it's harder to sometimes get it to work

579
01:01:35,349 --> 01:01:46,730
because of this problem that we talked earlier where you are trying to compute this
exact state, action value for many, for very high dimensions, but when it does work,

580
01:01:47,769 --> 01:01:54,322
for problems, for example, the Atari we saw earlier, then
it's usually more sample efficient than policy gradients.

581
01:01:54,322 --> 01:01:59,902
Right, and one of the challenges in Q-learning is that you
want to make sure that you're doing sufficient exploration.

582
01:01:59,902 --> 01:02:04,902
Yeah?
[inaudible question from audience]

583
01:02:14,313 --> 01:02:21,753
Oh, so for Q-learning can you do this process where you're, okay,
where you're trying to start this off by some supervised training?

584
01:02:21,753 --> 01:02:24,924
So, I guess the direct
approach for Q-learning doesn't

585
01:02:24,924 --> 01:02:27,532
do that because you're
trying to regress to these

586
01:02:27,532 --> 01:02:46,021
Q-values, right, instead of policy gradients over this distribution, but I think there are ways in which you can, like, massage this type
of thing to also bootstrap. Because I think bootstrapping in general or like behavior cloning is a good way to warm start these policies.

587
01:02:47,454 --> 01:02:54,213
Okay, so, right, so we've talked about policy gradients
and Q-learning, and just another look at some of these,

588
01:02:54,213 --> 01:02:56,752
some of the guarantees that you have,
right, with policy gradients.

589
01:02:56,752 --> 01:03:02,789
One thing we do know that's really nice is that this
will always converge to a local minimum of J of theta,

590
01:03:04,339 --> 01:03:12,131
because we're just directly doing gradient ascent, and so this is
often, and this local minimum is often just pretty good, right.

591
01:03:12,131 --> 01:03:17,041
And in Q-learning, on the other hand, we don't have any
guarantees because here we're trying to approximate

592
01:03:17,041 --> 01:03:23,358
this Bellman equation with a complicated function
approximator and so, in this case, this is the problem

593
01:03:23,358 --> 01:03:29,954
with Q-learning being a little bit trickier to train
in terms of applicability to a wide range of problems.

594
01:03:31,849 --> 01:03:41,546
Alright, so today you got basically very, brief, kind of high-level overview
of reinforcement learning and some major classes of algorithms in RL.

595
01:03:41,546 --> 01:03:47,577
And next time we're going to have a guest
lecturer from, Song Han, who's done a lot

596
01:03:47,577 --> 01:03:52,569
of pioneering work in model compression
and energy efficient deep learning,

597
01:03:52,569 --> 01:03:56,459
and so he's going to talk some
of this, about some of this.

598
01:03:56,459 --> 01:03:58,459
Thank you.